[CITASA] seeking suggestions about research methodology for online forums

Thomas M. Lento thomas.lento at gmail.com
Thu Feb 25 20:54:29 CST 2010


There are all kinds of supervised and semi-supervised content analysis
approaches that might be applicable - their effectiveness depends on the
nature of your data, the level of detail you require, and your ability to
design training data. For example, if you have a superset of health
information topics and keywords associated with them you could probably do a
simple keyword analysis to give you an idea of which posts are talking about
which topic. The advantage is that's pretty easy to do even if the data is
stored as text blobs in your MySQL database. However, if you need to do
serious contextual analysis or you don't already know the scope and range of
possible health topics and keywords that's a more difficult problem. You'll
probably need to try some different machine learning approaches and see what
works best for you needs. You can do literature searches for papers
published at ICWSM, KDD, or WWW for examples of some approaches. There are
other conferences that will be applicable, but those are good places to
start. Follow the citation trail from the relevant pieces and you should get
some idea of where to look for useful references.

Network analysis of these things can either be pretty straightforward,
moderately complicated, or totally impossible. It all depends on how your
data is structured and what type of information is available. If you've got
a standard relational database structure designed to drive the content in
the online community then at best you're in for some work to reformat the
data tables into something useful for actual analysis. What you hope for are
tables with various combinations of userid, postid, time stamp, user_data,
and post_data. You'll probably need to join several tables to get the actual
data files you need, and although MySQL is not optimized for joins it should
still be manageable.

Assuming you're new to this type of analysis, my advice is to spend a lot of
time familiarizing yourself with the underlying data. Do some basic
distributions and see how much noise you've got in the system, find out the
most efficient routes to generating the output you need, and discover what
information is available in which table and how those tables are keyed and
indexed. Make sure you're on the lookout for garbage data - a lot of the
great data you get from various online sources is basically bad, either
because of system errors (rare and usually easy to find) or "bad" users
(common and not so easy to find - there's a whole literature on spam
detection algorithms out there). You'll need to make decisions about your
error tolerance and what types of behaviors you wish to ignore, and you can
only do that effectively if you understand your data.

I don't know how much experience you have with database queries, but
assuming you can only handle moderately complex queries my advice is to use
the database as a source for your final dataset and then conduct your
analysis in some other tool. My typical approach to this situation is to
write database queries that produce flat text files with one row per
observation (typically per user, or per user/time_period combination, but
this obviously depends on your research question) with one column for each
metric. Then I load the data into R or Stata or whatever else and build
models. You can do a fair amount of work in MySQL, but this is typically
slower and more difficult than exporting and using an actual statistical
package.

The drawback of exporting data is that you're limited to whatever your stats
package can hold in memory. If your data set is large (hundreds of thousands
or millions of observations) then you need a fair amount of memory to run
any kind of complex model. If you're dealing with 10s or 100s of millions of
observations in your model then things get really interesting - I suggest
sampling, but there are other more difficult options.

If you don't know MySQL at all, you need to learn it. There are a plethora
of books on MySQL out there - I like O'Reilly for reference and SAMS for
instruction, so if you get something from one of those publishers you should
be ok. You will also want to learn how to do some scripting in Python or
Perl. For the rank beginner, I recommend going with Python and learning by
working through the chapters and exercises in How To Think Like a Computer
Scientist (free online at
http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If you know how
to program in general then diveintopython.org is your best bet. I'm not a
Perl guy, but I'm sure someone can point you to resources.

For the actual network analysis, I'd first look into NodeXL since it's easy
to use, and if that doesn't meet your needs I'd go with something like
igraph (see http://igraph.sourceforge.net/ ), which works with both R and
Python.

Best of luck.

-Tom

On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale <timhale at uab.edu> wrote:

> Hi everyone,
>
> I am working with others on a project that will examine health
> communication among members of an online community who post to an online
> forum. We have three primary goals: (1) to conduct a content analysis to
> understand the types of health information that is communicated; (2)
> identify the context of the discussions, including understanding the
> characteristics of the individuals who initiate and disseminate health
> information; and (3) to conduct a social network analysis to examine the
> larger structures of health information sharing among community members.
>
> Although this type of research could be conducted by manually collecting
> posts from the online forum, coding for content, and the creation of a data
> set for social network analysis -- we are interested in other approaches
> that make better use of the forum database files. We have the cooperation of
> the website owner and administrator to access the MySQL database.
>
> I am seeking advice from anyone with experience working on similar research
> questions involving online forums and especially, making use of the original
> forum database files. All recommendations, suggestions, and pointers to
> articles, books, and appropriate tools are welcome and greatly appreciated.
>
> Thank you,
> Tim Hale
>
> ------------------------------------------------------------
> Timothy M. Hale, MA
> University of Alabama at Birmingham
> Department of Sociology
> Heritage Hall 460E
> 1401 University Boulevard
> Birmingham, AL 35294-1152
> 205.222.8108 (cell)
> timhale at uab.edu
>
>
> _______________________________________________
> CITASA mailing list
> CITASA at list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.citasa.org/pipermail/citasa_list.citasa.org/attachments/20100225/f8bebdde/attachment.html>


More information about the CITASA mailing list